# THE USER CAN SELECT A SPECIES HERE!
# Alnus, Betula, Corylus or Poaceae (Alder, Birch, Hazel or Grasses)
species_sel <- "Betula"
threshold <- 1
In this project we want to evaluate the new pollen forecast module that uses “realtime data” for calibration. As new automatic pollen monitors are being deployed, realtime calibration of pollen forecast from weather models becomes more and more relevant.
This study uses old Hirst measurements and weather model reforcasts to investigate various implementations and possibilities in terms of forecast improvements.
For detailed information about the final implementation in COSMO-1E please refer to this documentation page (ask meteoswiss for access): https://service.meteoswiss.ch/confluence/x/dYQYBQ
Three datasets are available for that endeavour:
We have four species available for this analysis:
The analysis will be carried out for each species individually, but all for all stations and both years combined.
The observations are provided at 14 different stations, whereof one is excluded from the analysis: Davos/Wolfgang is high up in the mountains and pollen measurements are almost always zero.
The following settings are crucial and should always be remembered when running the chunks below:
Observations of low pollen concentrations are uncertain. Therefore, times- tamps with modelled or measured values can be removed from all three data sets for the numerical analysis using the variable in the first chunk of this vignette. This symmetrical exclusion is necessary to make sure that no artificial biases are introduced. For the analyses carried out for specific health impact based categories (i.e. concentration bins), timestamps with measured values < threshold are removed only (i.e. modelled values are allowed to be < threshold. This asymmetrical exclusion is justified for the investigation of relative changes in categorical metrics. In the time series plot (Figure 1) and the boxplot (Figure 5) values < threshold are not removed at all for better visibility. Whether low values < threshold were removed are not will be denoted in all figure captions and should allow the reader to get a full picture of the data.
First we compare the three data sets with the traditional ANOVA approach. Statistical inference (p-values, confidence intervals, . . . ) is only valid if the model assumptions are fulfilled. So far, this means (many paragraphs are quoted from Lukas Meier ETH - Script Applied Statistics ANOVA Course):
The first assumption is most crucial (but also most difficult to check). If the independence assumption is violated, statistical inference can be very inaccurate. In the ANOVA setting, the last assumption is typically not as important compared to a regression setting, as we are typically fitting “large” models.
In a QQ-plot we plot the empirical quantiles (“what we see in the data”) vs. the theoretical quantiles (“what we expect from the model”). The plot should show a more or less straight line if the distributional assumption is correct. By default, a standard normal distribution is the theoretical “reference distribution”.
They are definitely not and we have to do some adjustments. So for the following plot we logarithmic the data to deal with the right-skewedness. The best results were achieved by first logarithmic the data and then taking the square root.
### Do the errors have mean zero? & Is the error variance constant?
The Tukey-Anscombe plot plots the residuals vs. the fitted values. It
allows us to check whether the residuals have constant variance and
whether the residuals have mean zero (i.e. they don’t show any
deterministic pattern). We don’t plot the smoothing line as loess (and
other) algorithms have issues when the same value is repeated a large
number of times (jitter did not really help).
If the data has some serial structure (i.e., if observations were recorded in a certain time order), we typically want to check whether residuals close in time are more similar than residuals far apart, as this would be a violation of the independence assumption. We can do so by using a so-called index plot where we plot the residuals against time. For positively dependent residuals we would see time periods where most residuals have the same sign, while for negatively dependent residuals, the residuals would “jump” too often from positive to negative compared to independent residuals.
Summary: Redidual Analysis shows that the assumptions of “normal” statiscal methods are validated even for log(daily values). It is therefore suggested to continue the analysis with robust and simple metrics.
General overview of the daily concentration values as represented in the three timeseries. Each plot shows one species in one year. The seperate lines represent individual measurement stations.
First as a boxplot and second as a histogram.
The correlation between the Model and Measurements can be calculated easily and then the CI and p-values must be adjusted for multiple comparison. The corr-test function from the psych handily offers this functionality.
Careful the correlation coefficients method have some serious shortcomings:
The correlation coefficient measures linear agreement–whether the measurements go up-and-down together. Certainly, we want the measures to go up-and-down together, but the correlation coefficient itself is deficient in at least three ways as a measure of agreement. (http://www.jerrydallal.com/LHSP/compare.htm)
A good summary of the methods and their shortcomings can be found here: https://www.statisticssolutions.com/correlation-Pearson-Kendall-spearman/
The well established AB-method for clinical trials can be used here as well to compare the means and differences between datasets. If the points lie within the two SD-line for the differences the datasets can be assumed to be strongly associated with each other.
These plots allow to observe the error for different concentration categories.
First, various metrics are compared where the pollen concentrations are considered a continuous numerical variable.
| R2 | Bias | SD | MAE | RMSE | MSLE | RMSLE | type |
|---|---|---|---|---|---|---|---|
| 0.3632910 | -43.28973 | 332.4087 | 151.9309 | 335.0764 | 0.4240048 | 0.6511565 | baseline |
| 0.3712509 | 12.48690 | 319.7278 | 143.3437 | 319.8366 | 0.4106100 | 0.6407886 | calibration |
Second, the values will be converted into health impact based buckets. The impact classes have been defined https://service.meteoswiss.ch/confluence/x/1ZG4 Now we can investigate various metrics that are typically used for categoric variables. The Kappa metric is explained here and was chosen as the most meaningful metric for this analysis: https://towardsdatascience.com/multi-class-metrics-made-simple-the-kappa-score-aka-cohens-kappa-coefficient-bdea137af09c
| type | Accuracy | Kappa |
|---|---|---|
| baseline | 0.3559748 | 0.1998443 |
| calibration | 0.4578616 | 0.2617470 |
To takes into account that the health impact levels are ordered. We can treat them as numerical values from 0:nothing - 4: very strong.
| MAE_baseline | MAE_calibration |
|---|---|
| 0.7515723 | 0.609434 |
The following table could be used in the appendix.
Reference Event No Event Predicted Event A B No Event C D The formulas used here are:
| Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | class | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NA | 0.746 | NA | NA | 0.000 | NA | NA | 0.000 | 0.000 | 0.254 | NA | Class: nothing | baseline |
| NA | 0.968 | NA | NA | 0.000 | NA | NA | 0.000 | 0.000 | 0.032 | NA | Class: nothing | calibration |
| 0.276 | 0.891 | 0.655 | 0.621 | 0.655 | 0.276 | 0.388 | 0.429 | 0.118 | 0.181 | 0.583 | Class: weak | baseline |
| 0.309 | 0.938 | 0.790 | 0.644 | 0.790 | 0.309 | 0.445 | 0.429 | 0.133 | 0.168 | 0.624 | Class: weak | calibration |
| 0.372 | 0.854 | 0.526 | 0.756 | 0.526 | 0.372 | 0.436 | 0.304 | 0.113 | 0.215 | 0.613 | Class: medium | baseline |
| 0.620 | 0.555 | 0.379 | 0.769 | 0.379 | 0.620 | 0.470 | 0.304 | 0.189 | 0.498 | 0.587 | Class: medium | calibration |
| 0.550 | 0.794 | 0.344 | 0.899 | 0.344 | 0.550 | 0.424 | 0.165 | 0.091 | 0.263 | 0.672 | Class: strong | baseline |
| 0.405 | 0.866 | 0.373 | 0.881 | 0.373 | 0.405 | 0.388 | 0.165 | 0.067 | 0.179 | 0.635 | Class: strong | calibration |
| 0.333 | 0.940 | 0.388 | 0.926 | 0.388 | 0.333 | 0.359 | 0.102 | 0.034 | 0.087 | 0.637 | Class: verystrong | baseline |
| 0.685 | 0.940 | 0.566 | 0.963 | 0.566 | 0.685 | 0.620 | 0.102 | 0.070 | 0.123 | 0.813 | Class: verystrong | calibration |
https://www.researchgate.net/publication/282206980_nparcomp_An_R_Software_Package_for_Nonparametric_Multiple_Comparisons_and_Simultaneous_Confidence_Intervals The R package nparcomp implements a broad range of rank-based nonparametric methods for multiple comparisons. The single step procedures provide local test decisions in terms of multiplicity adjusted p-values and simultaneous confidence intervals. The null hypothesis H0: p = 1/2 is significantly rejected at 5% level of significance for many pairwise comparisons. Whenever the p-Value is < than 5% = the confidence interval contains 0.5 -> the effect from the factor trap is not statistically meaningful. The Estimator can also be interpreted as a proxy for the relative difference in median between Model and Measurements. If the Estimator is > 0.5 then the model tends to have larger measurements.
##
## #------Nonparametric Multiple Comparisons for relative contrast effects-----#
##
## - Alternative Hypothesis: True relative contrast effect p is not equal to 1/2
## - Type of Contrast : Dunnett
## - Confidence level: 95 %
## - Method = Logit - Transformation
## - Estimation Method: Pairwise rankings
##
## #---------------------------Interpretation----------------------------------#
## p(a,b) > 1/2 : b tends to be larger than a
## #---------------------------------------------------------------------------#
##
| Taxon | Comparison | Estimator | Lower | Upper | pValue |
|---|---|---|---|---|---|
| Betula | p( measurement , baseline ) | 0.548 | 0.522 | 0.574 | 0.000114727369683987 |
| Betula | p( measurement , calibration ) | 0.584 | 0.558 | 0.610 | 4.31366053987858e-12 |
Reviewer 1 requested an additional graph displaying the changes of the tune factor over the course of a season. The following chunk loads the data and saves a line-plot:
Final version of the model:
Version 2 below had the following settings:
Version 4 has the same setup as the final version except:
Version 6 had the same setup as the final version, but the changes to the tuning factor were calculated based on the past 24h hours only:
Reviewer 2 requested a similar graph faceted by station but for the concentrations.